Groupwork - Text Mining

Group 9

2024-06-04

Text Mining Task

In this analysis, we delved into the tweet engagement patterns for Bern University of Applied Sciences (BFH) compared to other Swiss higher education institutions. We utilized various visual and textual data to identify key engagement metrics and content strategies that could optimize BFH’s social media performance.

Data Import & Analysis

In this area, we will load and analyze the data for the work in order to make decisions on how to proceed.

Data Import

We Import the dataset SP500_data.csv and make a copy to work with it and named it data. We copy it so we can be secure that i do not make any changes in the original dataset.
We use several libraries to process the tasks and get the output that is asked for.

# Load data set and make a copy of the original
set.seed(123)
options(scipen=999)
Tweets_all <- load("Tweets_all.rda")


Data Exploration

This section gives a concise view of the tweets on the Swiss University Social Media accounts data.
The dataset consists of 19’575 observations and 14 variables:

Time Range and Tweet Frequency:

  • The tweets are from September 29, 2009 to January 26, 2023 and this indicates a long-term use of Twitter
  • The median tweet date is April 13, 2018, suggesting that half of the tweets were posted after this date and the data is skewed

Retweet and Favorite Counts:

  • The data shows a minimum of 0 and a maximum of 267 retweets and 188 likes per tweet
  • the median and first quartile for retweets and likes are 0, indicating that many tweets receive little to no engagement
  • The in_reply_to_screen_name field suggests that some tweets are responses to other users, which might indicate engagement or conversation strategies used by the university

ID and String Variables:

  • The id and id_str fields are technical identifiers for tweets, indicating that tweets have been collected over a wide range of tweets

Language and University Fields:

  • The lang shows the common language used at the university
  • university shows the abbreviation of the university

Temporal Patterns:

  • created_at, tweet_date, tweet_hour, and tweet_month provide detailed temporal data
  • can be analyzed to understand peak times of activity and seasonal or monthly trends in tweeting behavior.

Content Analysis

The word cloud represents the most frequently used words in the filtered tweets with high engagement (likes or retweets). Key observations include:

Frequent Terms: Larger words such as β€œbachelor,” β€œdesign,” β€œdie,” β€œdas,” β€œder,” and β€œamp” indicate their higher occurrence. Key Topics: β€œbachelor” for Bachelor’s programs or graduates. β€œdesign” related to design courses or projects. β€œHSLU” (Hochschule Luzern). General terms: β€œschweiz,” β€œzeigen,” β€œnicht.” Note: The term β€œamp” appears due to HTML encoding and is not meaningful.


# Display first six rows of 'tweets'
head(tweets)
## # A tibble: 6 Γ— 14
##   created_at               id id_str            full_text in_reply_to_screen_n…¹
##   <dttm>                <dbl> <chr>             <chr>     <chr>                 
## 1 2023-01-20 17:17:32 1.62e18 1616469988369469… "Im MSc … <NA>                  
## 2 2023-01-13 07:52:01 1.61e18 1613790954737074… "Was bew… <NA>                  
## 3 2023-01-12 19:30:01 1.61e18 1613604227141537… "Was uns… <NA>                  
## 4 2023-01-12 08:23:00 1.61e18 1613436367169634… "Eine di… <NA>                  
## 5 2023-01-11 14:00:05 1.61e18 1613158809081450… "Wir gra… <NA>                  
## 6 2023-01-10 17:06:11 1.61e18 1612843252083834… "Unsere … <NA>                  
## # β„Ή abbreviated name: ¹​in_reply_to_screen_name
## # β„Ή 9 more variables: retweet_count <int>, favorite_count <int>, lang <chr>,
## #   university <chr>, tweet_date <dttm>, tweet_minute <dttm>,
## #   tweet_hour <dttm>, tweet_month <date>, timeofday_hour <chr>
# Provide summary statistics
summary(tweets)
##    created_at                          id                     
##  Min.   :2009-09-29 14:29:47.0   Min.   :         4468752018  
##  1st Qu.:2015-01-28 15:07:41.5   1st Qu.: 560439073866000000  
##  Median :2018-04-13 13:26:56.0   Median : 984754806702000000  
##  Mean   :2017-12-09 15:26:50.7   Mean   : 939953703992000000  
##  3rd Qu.:2020-10-20 10:34:50.0   3rd Qu.:1318470720360000000  
##  Max.   :2023-01-26 14:49:31.0   Max.   :1618607065240000000  
##     id_str           full_text         in_reply_to_screen_name
##  Length:19575       Length:19575       Length:19575           
##  Class :character   Class :character   Class :character       
##  Mode  :character   Mode  :character   Mode  :character       
##                                                               
##                                                               
##                                                               
##  retweet_count     favorite_count       lang            university       
##  Min.   :  0.000   Min.   :  0.00   Length:19575       Length:19575      
##  1st Qu.:  0.000   1st Qu.:  0.00   Class :character   Class :character  
##  Median :  1.000   Median :  0.00   Mode  :character   Mode  :character  
##  Mean   :  1.289   Mean   :  1.37                                        
##  3rd Qu.:  2.000   3rd Qu.:  2.00                                        
##  Max.   :267.000   Max.   :188.00                                        
##    tweet_date                      tweet_minute                   
##  Min.   :2009-09-29 00:00:00.00   Min.   :2009-09-29 14:29:00.00  
##  1st Qu.:2015-01-28 00:00:00.00   1st Qu.:2015-01-28 15:07:00.00  
##  Median :2018-04-13 00:00:00.00   Median :2018-04-13 13:26:00.00  
##  Mean   :2017-12-09 02:25:45.00   Mean   :2017-12-09 15:26:24.68  
##  3rd Qu.:2020-10-20 00:00:00.00   3rd Qu.:2020-10-20 10:34:30.00  
##  Max.   :2023-01-26 00:00:00.00   Max.   :2023-01-26 14:49:00.00  
##    tweet_hour                      tweet_month         timeofday_hour    
##  Min.   :2009-09-29 14:00:00.00   Min.   :2009-09-01   Length:19575      
##  1st Qu.:2015-01-28 14:30:00.00   1st Qu.:2015-01-01   Class :character  
##  Median :2018-04-13 13:00:00.00   Median :2018-04-01   Mode  :character  
##  Mean   :2017-12-09 14:59:43.81   Mean   :2017-11-24                     
##  3rd Qu.:2020-10-20 10:00:00.00   3rd Qu.:2020-10-01                     
##  Max.   :2023-01-26 14:00:00.00   Max.   :2023-01-01

Data Manipulation

Next will prepare the data for analysis.

Languages

Here, we calculate the frequency of each language present in the tweets dataset and sorts these frequencies in descending order.
The output indicates that German (de) is the most common language with 14,474 occurrences, followed by Italian (it) with 1,865 and French (fr) with 1,792. English (en) comes next with 1,280 tweets. The frequencies of other languages, including rare and less commonly used ones, are also listed, showcasing the linguistic diversity in the dataset.

# Count the frequency of each language
lang_counts <- table(tweets$lang)

# Sort the language frequencies in descending order
sort(lang_counts, decreasing = TRUE)
## 
##    de    it    fr    en   qam   qme    es    ca    da    ro    nl    in    et 
## 14474  1865  1792  1280    35    21    19    10    10    10     9     7     6 
##   und    pt   zxx   art    lv    cy    fi    lt    no   qht    cs    eu    ht 
##     6     4     4     3     3     2     2     2     2     2     1     1     1 
##    ja    sv    tl    tr 
##     1     1     1     1


Due to the fact that German, Italian, French and English are the most frequently listed languages and other languages are not used in large numbers and are not among the most spoken languages in Switzerland, we limit the dataset to the 4 most important ones.

# Filter the DataFrame to keep only tweets in German, Italian, French and English
filtered_tweets <- tweets[tweets$lang %in% c("de", "it", "fr", "en"), ]

# Check the resulting language distribution
table(filtered_tweets$lang)
## 
##    de    en    fr    it 
## 14474  1280  1792  1865


This gives us the new summary of the dataset:

  • Number of Records: The total count of tweets has decreased from 19,575 to 19,411, indicating some tweets have been removed or filtered out.
  • Date and Time: Minimal changes are reflected across the median and mean values.
  • Other Attributes: No significant changes are observed in the ranges.
# Provide summary statistics
summary(filtered_tweets)
##    created_at                           id                     
##  Min.   :2009-09-29 14:29:47.00   Min.   :         4468752018  
##  1st Qu.:2015-02-04 11:39:32.00   1st Qu.: 562923403041000000  
##  Median :2018-04-17 13:53:07.00   Median : 986210946744999936  
##  Mean   :2017-12-11 15:27:49.55   Mean   : 940675313339000064  
##  3rd Qu.:2020-10-20 11:09:15.50   3rd Qu.:1318479385120000000  
##  Max.   :2023-01-26 14:49:31.00   Max.   :1618607065240000000  
##     id_str           full_text         in_reply_to_screen_name
##  Length:19411       Length:19411       Length:19411           
##  Class :character   Class :character   Class :character       
##  Mode  :character   Mode  :character   Mode  :character       
##                                                               
##                                                               
##                                                               
##  retweet_count     favorite_count        lang            university       
##  Min.   :  0.000   Min.   :  0.000   Length:19411       Length:19411      
##  1st Qu.:  0.000   1st Qu.:  0.000   Class :character   Class :character  
##  Median :  1.000   Median :  0.000   Mode  :character   Mode  :character  
##  Mean   :  1.293   Mean   :  1.376                                        
##  3rd Qu.:  2.000   3rd Qu.:  2.000                                        
##  Max.   :267.000   Max.   :188.000                                        
##    tweet_date                     tweet_minute                   
##  Min.   :2009-09-29 00:00:00.0   Min.   :2009-09-29 14:29:00.00  
##  1st Qu.:2015-02-04 00:00:00.0   1st Qu.:2015-02-04 11:39:00.00  
##  Median :2018-04-17 00:00:00.0   Median :2018-04-17 13:53:00.00  
##  Mean   :2017-12-11 02:26:53.7   Mean   :2017-12-11 15:27:23.56  
##  3rd Qu.:2020-10-20 00:00:00.0   3rd Qu.:2020-10-20 11:09:00.00  
##  Max.   :2023-01-26 00:00:00.0   Max.   :2023-01-26 14:49:00.00  
##    tweet_hour                      tweet_month         timeofday_hour    
##  Min.   :2009-09-29 14:00:00.00   Min.   :2009-09-01   Length:19411      
##  1st Qu.:2015-02-04 11:30:00.00   1st Qu.:2015-02-01   Class :character  
##  Median :2018-04-17 13:00:00.00   Median :2018-04-01   Mode  :character  
##  Mean   :2017-12-11 15:00:42.28   Mean   :2017-11-26                     
##  3rd Qu.:2020-10-20 10:30:00.00   3rd Qu.:2020-10-01                     
##  Max.   :2023-01-26 14:00:00.00   Max.   :2023-01-01

Emojis

The package emo is used for emoji analysis in R, which is essential for text data that includes emojis. This is useful for cleaning data, extracting information, or preparing text for further analysis.
Understanding the prevalence of emojis can help analyze sentiment, user engagement, or cultural trends in social media data.

# Install the emo package from GitHub for Emoji analyzes
if (!require("emo")) {
  remotes::install_github("hadley/emo")
}
## Lade nΓΆtiges Paket: emo
library(emo)

Tweet Analysis

In this section we will use the prepared data to analyze the tweets for frequency, interactions and universities.

Tweet Frequency Analysis

In this section we will analyze the tweets for frequency of Swiss universities.

Tweet Frequency over Time

Each histogram shows fluctuations in tweet volumes over the years

  • Universities like HSLU and ZHAW: Display prominent peaks at certain intervals, possibly indicating targeted social media campaigns or significant events that engaged the university community.
  • Other Universities (e.g., BFH, FHNW): Some show a steady level of activity with occasional spikes, while others might exhibit a decline or increase in activity, suggesting changes in social media strategy or external factors impacting engagement.
# Code to analyze tweet frequencies by time and university
p1<- filtered_tweets %>%
  mutate(tweet_month = floor_date(created_at, "month")) %>%
  group_by(university, tweet_month) %>%
  summarize(count = n(), .groups = 'drop') %>%
  ggplot(aes(x = tweet_month, y = count, fill = university)) +
  geom_col(position = "dodge") +
  theme_minimal() +
  labs(title = "Monthly Tweet Frequency by University", x = "Year", y = "Number of Tweets")

# Convert to interactive plotly object
interactive_plot <- ggplotly(p1, tooltip = "text")

# Optionally, add configurations to enhance interaction
interactive_plot <- interactive_plot %>% layout(
  hovermode = 'closest',
  title = "Click on a University to see its Tweet Trends",
  showlegend = TRUE
)

interactive_plot

Tweet Frequency - Terms

Here we return terms that meet the high frequency threshold.

Text Preprocessing

We create a text corpus from filtered_tweets$clean_text, where each tweet is treated as a separate document.
The corpus serves as the foundational structure for text analysis, allowing for uniform processing and manipulation of the text data.

# Corpus: Collection of text documents that generally serves as a basis for analysis in text processing and text mining.
# VectorSource(tweets): This vector is then used as the source for the corpus, whereby each entry in the vector becomes a separate document in the corpus.
# It is important that the text is extracted, as the corpus should only work with text data.
corpus <- Corpus(VectorSource(filtered_tweets$clean_text))


Here we clean the corpus by converting all text to lowercase, removing punctuation, numbers, and stopwords from German, French, Italian, and English, and finally stripping extra spaces.
Cleaning the text is crucial for reducing noise and focusing analyses on meaningful words only. This standardizes the text data, making subsequent analyses like topic modeling or sentiment analysis more effective and less prone to error due to textual inconsistencies.

# Clean text
corpus <- tm_map(corpus, content_transformer(tolower))  # Convert to lower case
corpus <- tm_map(corpus, removePunctuation)             # Removing punctuation marks
corpus <- tm_map(corpus, removeNumbers)                 # Removing numbers
corpus <- tm_map(corpus, removeWords, stopwords("german"))  # Removing stop words
corpus <- tm_map(corpus, removeWords, stopwords("french"))
corpus <- tm_map(corpus, removeWords, stopwords("italian"))
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)               # Removal of additional spaces
corpus <- tm_map(corpus, stemDocument) #remove suffixes, etc.; only root form of the word

# Further clean the text by removing specific web/text symbols and terms
corpus <- tm_map(corpus, content_transformer(function(x) {
  x <- gsub("–", "", x)
  x <- gsub("…", "", x) 
  x <- gsub("Β«", "", x) 
  x <- gsub("Β»", "", x) 
  x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE)  # Remove 'rt', 'www', and 'emojiemoji'
  x <- gsub("amp", "", x, ignore.case = TRUE)  # Remove 'amp' from HTML encoded '&'
  x <- gsub("http[s]?://\\S+", "", x)  # Remove URLs
  return(x)
}))


Here we create a Document-Term Matrix (DTM) from the corpus, applying additional filters like punctuation removal and stopping word exclusion during the matrix formation. Then, it filters out terms that appear in less than 1% of the documents to reduce sparsity.
Reducing sparsity helps focus on terms that have significant presence across documents, enhancing the reliability and performance of statistical models and algorithms applied later.

# Create DTM and remove sparse terms
dtm1 <- DocumentTermMatrix(corpus, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm1 <- removeSparseTerms(dtm1, sparse = 0.99)  # Adjust sparsity threshold as needed


Terms Analysis:

  • Dominant Themes: Words like β€œschweizer” (Swiss), β€œunternehmen” (companies), β€œzukunft” (future), β€œinnov” (innovation), and β€œdigital” suggest that the text data heavily revolves around themes of Swiss companies, innovation, and digital advancements.
  • Common Words: Frequent appearance of terms like β€œdank” (thanks), β€œneue” (new), β€œmehr” (more), and β€œinfo” indicate common communication patterns possibly related to news dissemination or updates about new developments and initiatives.
set.seed(123)
# Ensure word names are captured
word_freq1 <- sort(rowSums(as.matrix(dtm1)), decreasing = TRUE)
top_word_freq1 <- head(word_freq1, 80)
word_names1 <- colnames(dtm1)

# Generate word cloud using the correct word names
wordcloud(
  words = word_names1, 
  freq = top_word_freq1, 
  max.words = 80,
  scale = c(4, 0.5),       # Control for size of the most and least frequent words
  random.order = FALSE,    # Higher frequency words appear first
  rot.per = 0.25,          # Allows some rotation for fitting
  colors = brewer.pal(8, "Dark2")  # Enhances visual appeal
)

Tweets Frequency - Emojis

  • Engagement Strategy: The frequent use of directional emojis like ➑️, ‡️, and πŸ‘‰ suggests that guiding readers to additional content or important links is a successful strategy.
  • Content Themes: Emojis like πŸ“–, πŸ”Ž, πŸ’», and πŸ’‘ highlight the focus on education, research, and technology.
  • Celebratory Communication: Emojis such as πŸ‘, πŸŽ‰, πŸŽ“, and πŸ₯³ signify celebration and achievement.
# Analyze the frequency of different emojis and select the top 50
emoji_freq2 <- table(unlist(filtered_tweets$emojis))
sort(emoji_freq2, decreasing = TRUE)[1:50]
## 
##    ➑️    ‡️   πŸ‘‰   πŸ“–   πŸ”Ž   πŸ‡¨πŸ‡­   πŸ’‘   πŸ’»   πŸ‘   πŸŽ‰   πŸ“£   πŸš€   ✨   🎬   πŸ’›   πŸ”¬ 
##  414  247  180  117   97   75   67   67   65   63   57   56   45   44   38   36 
##   πŸ€–   πŸ†•   πŸ“…   πŸ–€   🚨   πŸŽ“    πŸŽ™οΈ   πŸŽ„   πŸ˜ƒ   πŸ†   πŸ‘‡   πŸ“Έ   πŸ‘   πŸ’ͺ   ⚑   🌱 
##   36   35   32   32   32   30   28   26   26   25   23   23   22   22   21   21 
## πŸ‘©β€πŸŽ“    ▢️   🌍   πŸ…    β˜€οΈ πŸ‘¨β€πŸŽ“   πŸ™Œ   🌳   πŸ₯‚   πŸ₯³   🍾   πŸ“   πŸ“’   πŸ”‹   πŸ˜‰   🀝 
##   21   20   20   20   19   19   19   18   18   18   17   17   17   17   17   17 
##   πŸ“š   😎 
##   16   16

High Engagement

In this section, we want to focus on tweets that have attracted more attention and interaction.

High Engagement - Terms

Text Preprocessing:

This section sets a variable engagement_threshold to 20, which is used as the minimum number of likes or retweets a tweet must have to be considered as having β€œhigh engagement”. This threshold helps to focus on tweets that have garnered more attention and interaction.

# Set a threshold for "high engagement" (e.g., tweets with at least 20 likes or retweets)
engagement_threshold <- 20

# Filter tweets based on this engagement threshold
high_engagement_tweets <- filtered_tweets %>%
  filter(favorite_count >= engagement_threshold | retweet_count >= engagement_threshold)


Also for the high_engagement_tweets we clean the corpus by converting all text to lowercase, removing punctuation, numbers, and stopwords from German, French, Italian, and English, and finally stripping extra spaces and we create a Document-Term Matrix (DTM) from this corpus.

# Rebuild the corpus with the sampled data
corpus2 <- Corpus(VectorSource(high_engagement_tweets$clean_text)) 

corpus2 <- tm_map(corpus2, content_transformer(tolower))  # Convert to lower case
corpus2 <- tm_map(corpus2, removePunctuation)             # Removing punctuation marks
corpus2 <- tm_map(corpus2, removeNumbers)                 # Removing numbers
corpus2 <- tm_map(corpus2, removeWords, stopwords("german"))  # Removing stop words
corpus2 <- tm_map(corpus2, removeWords, stopwords("french"))
corpus2 <- tm_map(corpus2, removeWords, stopwords("italian"))
corpus2 <- tm_map(corpus2, removeWords, stopwords("english"))
corpus2 <- tm_map(corpus2, stripWhitespace)               # Removal of additional spaces
corpus2 <- tm_map(corpus2, stemDocument) #remove suffixes, etc.; only root form of the word

# Further clean the text by removing specific web/text symbols and terms
corpus2 <- tm_map(corpus2, content_transformer(function(x) {
  x <- gsub("–", "", x)
  x <- gsub("…", "", x) 
  x <- gsub("Β«", "", x) 
  x <- gsub("Β»", "", x) 
  x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE)  # Remove 'rt', 'www', and 'emojiemoji'
  x <- gsub("amp", "", x, ignore.case = TRUE)  # Remove 'amp' from HTML encoded '&'
  x <- gsub("http[s]?://\\S+", "", x)  # Remove URLs
  return(x)
}))

# Create DTM and remove sparse terms
dtm <- DocumentTermMatrix(corpus2, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm <- removeSparseTerms(dtm, sparse = 0.99)  # Adjust sparsity threshold as needed


Text Analyse:

The word cloud effectively illustrates which topics are most engaging within the parameter for at least 20 likes or retweets. This visualization can help in refining the communication and engagement strategies by focusing on the topics that naturally engage your audience.

  • β€œforscherteam” (research team) and β€œentwickelt” (developed): suggest a strong emphasis on research and development topics.
  • β€œlab”: indicates discussions possibly related to laboratory work or scientific studies.
  • β€œdata” and β€œdigital”: reflect a focus on digital technologies and data science, crucial in contemporary research and education.
  • β€œopen”: could relate to open source, open access, or openness in research and education, pointing towards transparency and accessibility in academic resources.
  • β€œnein” (no) and β€œwieso” (why): might indicate debates or discussions, possibly questioning certain methods or findings.
  • β€œschweizer” (Swiss): identifies the national or cultural context, implying that the content is likely relevant to or originating from Swiss institutions or discussing Swiss innovations.
  • β€œgesprΓ€ch” (conversation): underscores the interactive or dialogical nature of the tweets, suggesting that engagement may be driven by conversational or discursive posts.
  • Not well cleaned elements: The presence of strings like β€œhttp” might be artifacts from URLs or specific hashtags, which although not directly meaningful, indicate the inclusion of links or specific calls to action in the tweets.
set.seed(123)
# Ensure word names are captured
word_freq <- sort(rowSums(as.matrix(dtm)), decreasing = TRUE)
top_word_freq <- head(word_freq, 80)
word_names <- colnames(dtm)

# Generate word cloud using the correct word names
wordcloud(
  words = word_names, 
  freq = top_word_freq, 
  max.words = 80,
  scale = c(4, 0.5),       # Control for size of the most and least frequent words
  random.order = FALSE,    # Higher frequency words appear first
  rot.per = 0.25,          # Allows some rotation for fitting
  colors = brewer.pal(8, "Dark2")  # Enhances visual appeal
)

High Engagement - Emojis

  • Utility and Guidance: Directional emojis like ➑️ and πŸ‘‰ suggest that providing clear guidance or calls to action within tweets is effective in garnering engagement.
  • Cultural and International Appeal: The presence of multiple national flags suggests that tweets connected to specific national contexts or international discussions.
  • Emotional and Informative Content: Emojis like ✨ (sparkles) and πŸ’› (heart) are often used to add emotional depth or positivity to tweets. Similarly, πŸ“… (calendar) and πŸ“’ (megaphone) likely denote event-related or important announcements that command attention.
# Analyze the frequency of different emojis
emoji_freq1 <- table(unlist(high_engagement_tweets$emojis))
sort(emoji_freq1, decreasing = TRUE)
## 
##  ➑️ πŸ‡¨πŸ‡­  ‡️ ✨ πŸ‡¨πŸ‡³ πŸ‡¬πŸ‡§ πŸ‡³πŸ‡± πŸ‡ΈπŸ‡ͺ πŸ‡ΈπŸ‡¬ πŸ‘‰ πŸ’› πŸ“… πŸ“’  πŸ—žοΈ πŸ˜€ πŸ˜‰ 🚊 🚨 
##  2  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

High Engagement - Hours

This graph shows tweet engagement by hour, indicating that the peak time for high engagement tweets occurs at 16:00 (4 PM). Engagement appears to be generally higher in the afternoon hours compared to the morning and late evening.

# Analyze and plot tweet counts by hour to find the best posting times
best_posting_hours1 <- high_engagement_tweets %>%
  group_by(timeofday_hour) %>%
  summarise(count = n(), .groups = 'drop') %>%
  arrange(desc(count))

# Plotting
ggplot(best_posting_hours1, aes(x = timeofday_hour, y = count)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Tweet Engagement by Hour",
       x = "Hour of the Day",
       y = "Number of High Engagement Tweets") +
  theme_minimal()

High Engagement - Days

It shows that Monday and Tuesday are the days with the highest engagement, indicating these might be optimal days for posting to maximize visibility and interaction. The engagement noticeably declines as the week progresses, with the lowest engagement occurring over the weekend, suggesting less audience activity during these days.

# Extract the day of the week from 'tweet_date'
high_engagement_tweets <- high_engagement_tweets %>%
  mutate(day_of_week = wday(tweet_date, label = TRUE, week_start = 1))  # Adjust 'week_start' if your week starts on a different day

# Analyze and plot tweet counts by day of the week
best_posting_days1 <- high_engagement_tweets %>%
  group_by(day_of_week) %>%
  summarise(count = n(), .groups = 'drop') %>%
  arrange(desc(count))

# Plotting
ggplot(best_posting_days1, aes(x = day_of_week, y = count)) +
  geom_bar(stat = "identity", fill = "coral") +
  labs(title = "Tweet Engagement by Day of the Week",
       x = "Day of the Week",
       y = "Number of High Engagement Tweets") +
  theme_minimal()

Engagement Analysis by University

The bar chart visualizes the total likes accumulated by different universities within the parameter for at least 20 likes or retweets, highlighting variations in engagement across these institutions on social media.
The visualization clearly shows which universities are receiving the most engagement in terms of likes. HSLU (Lucerne University of Applied Sciences and Arts) and ZHAW (Zurich University of Applied Sciences) stands out with the highest engagement, significantly more than other institutions. So institutions like HSLU and ZHAW, offering a pathway for others to refine their social media tactics.

# Analysis of likes and retweets
high_engagement_tweets %>%
  group_by(university) %>%
  summarize(total_likes = sum(favorite_count), total_retweets = sum(retweet_count), .groups = 'drop') %>%
  ggplot(aes(x = reorder(university, total_likes), y = total_likes)) +
  geom_col(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Engagement Analysis by University", x = "University", y = "Total Likes")
## Warning in geom_col(stat = "identity", fill = "steelblue"): Ignoring unknown
## parameters: `stat`

HSLU & ZHAW Engagement Analysis

In this area, we will analyze the universities HSLU (Lucerne University of Applied Sciences and Arts) and ZHAW (Zurich University of Applied Sciences) to find out why they have significantly more interactions compared to other universities.

HSLU & ZHAW Engagement - Terms

Text Preprocessing:

For this, we must again take text prepossessing measures, as in the previous analyses

#Filter Tweets for HSLU and ZHAW
hslu_zhaw_tweets <- filtered_tweets %>%
  filter(university %in% c("hslu", "ZHAW"))

# Set a threshold for "high engagement" (e.g., tweets with at least 10 likes or retweets)
engagement_threshold1 <- 20

# Filter tweets based on this engagement threshold
hslu_zhaw_high_engagement_tweets <- hslu_zhaw_tweets %>%
  filter(favorite_count >= engagement_threshold1 | retweet_count >= engagement_threshold1)

# Rebuild the corpus with the sampled data
corpus3 <- Corpus(VectorSource(hslu_zhaw_high_engagement_tweets$clean_text)) 

corpus3 <- tm_map(corpus3, content_transformer(tolower))  # Convert to lower case
corpus3 <- tm_map(corpus3, removePunctuation)             # Removing punctuation marks
corpus3 <- tm_map(corpus3, removeNumbers)                 # Removing numbers
corpus3 <- tm_map(corpus3, removeWords, stopwords("german"))  # Removing stop words
corpus3 <- tm_map(corpus3, removeWords, stopwords("french"))
corpus3 <- tm_map(corpus3, removeWords, stopwords("italian"))
corpus3 <- tm_map(corpus3, removeWords, stopwords("english"))
corpus3 <- tm_map(corpus3, stripWhitespace)               # Removal of additional spaces
corpus3 <- tm_map(corpus3, stemDocument) #remove suffixes, etc.; only root form of the word

# Further clean the text by removing specific web/text symbols and terms
corpus3 <- tm_map(corpus3, content_transformer(function(x) {
  x <- gsub("–", "", x)
  x <- gsub("…", "", x) 
  x <- gsub("Β«", "", x) 
  x <- gsub("Β»", "", x) 
  x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE)  # Remove 'rt', 'www', and 'emojiemoji'
  x <- gsub("amp", "", x, ignore.case = TRUE)  # Remove 'amp' from HTML encoded '&'
  x <- gsub("http[s]?://\\S+", "", x)  # Remove URLs
  return(x)
}))

# Create DTM and remove sparse terms
dtm2 <- DocumentTermMatrix(corpus3, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm2 <- removeSparseTerms(dtm2, sparse = 0.99)  # Adjust sparsity threshold as needed


Text Analyse:

The word cloud illustrates which topics are most engaging within the parameter for at least 10 likes or retweets.

  • Environmental Focus: Terms like β€œKlimaziel” suggest discussions around climate goals, indicating a strong environmental or sustainability focus within the discourse.
  • Economic Impact: Words such as β€œprofitieren” and β€œBeitrag” highlight discussions on economic benefits and contributions, potentially related to how environmental goals can align with economic gains.
  • Educational Context: The presence of β€œZHAW” directly ties the content to the Zurich University of Applied Sciences, suggesting these topics are relevant to university-led discussions or initiatives.
  • National Relevance: The inclusion of β€œSchweiz” ties the discussions to Switzerland, indicating that these topics are of national interest, potentially discussing Swiss policies or initiatives regarding sustainability.
  • Not well cleaned elements: The presence of strings like β€œhttpstcoezfrwxu” might be artifacts from URLs, which although not directly meaningful, indicate the inclusion of links or specific calls to action in the tweets.
set.seed(123)
# Ensure word names are captured
word_freq2 <- sort(rowSums(as.matrix(dtm2)), decreasing = TRUE)
top_word_freq2 <- head(word_freq2, 80)
word_names2 <- colnames(dtm2)

# Generate word cloud using the correct word names
wordcloud(
  words = word_names2, 
  freq = top_word_freq2, 
  max.words = 80,
  scale = c(4, 0.5),       # Control for size of the most and least frequent words
  random.order = FALSE,    # Higher frequency words appear first
  rot.per = 0.25,          # Allows some rotation for fitting
  colors = brewer.pal(8, "Dark2")  # Enhances visual appeal
)

HSLU & ZHAW Engagement - Emojis

  • Direction: The use of right arrow” (➑️) or β€œright-pointing finger” (πŸ‘‰), suggests a focus on direction or continuation, potentially indicating links or further content.
  • Positive Emojis: The inclusion of positive emojis like β€œyellow heart” (πŸ’›), or β€œsmiley face” (πŸ˜€) indicates a friendly, positive communication style.
  • Local Topics: The Swiss flag (πŸ‡¨πŸ‡­) might be used in contexts relating to national pride or local topics. Overall, these emojis contribute to engaging and positive social media interactions, which could be part of why these universities have higher engagement rates.
# Analyze the frequency of different emojis
emoji_freq <- table(unlist(hslu_zhaw_high_engagement_tweets$emojis))
sort(emoji_freq, decreasing = TRUE)
## 
##  ➑️ πŸ‡¨πŸ‡­ πŸ‘‰ πŸ’› πŸ˜€ πŸ˜‰ 
##  2  1  1  1  1  1

HSLU & ZHAW Engagement - Hours

The early morning (8 AM) and late afternoon to early evening (17 PM and 18 PM) are the most effective times to post content that is likely to get high engagement. It suggests that timing posts to align with these peak periods could enhance visibility and interaction for the universities social media content.

# Analyze and plot tweet counts by hour to find the best posting times
best_posting_hours <- hslu_zhaw_high_engagement_tweets %>%
  group_by(timeofday_hour) %>%
  summarise(count = n(), .groups = 'drop') %>%
  arrange(desc(count))

# Plotting
ggplot(best_posting_hours, aes(x = timeofday_hour, y = count)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Tweet Engagement by Hour for HSLU and ZHAW",
       x = "Hour of the Day",
       y = "Number of High Engagement Tweets") +
  theme_minimal()

HSLU & ZHAW Engagement - Days

This graph illustrates the distribution of high engagement tweets for HSLU and ZHAW by day of the week, showing that Tuesday is the most effective day to post on Twitter for maximizing engagement at these universities. The sharp drop in engagement over the weekend further supports the trend that weekdays, particularly the beginning of the week, are optimal for reaching the audience.

# Extract the day of the week from 'tweet_date'
hslu_zhaw_high_engagement_tweets <- hslu_zhaw_high_engagement_tweets %>%
  mutate(day_of_week = wday(tweet_date, label = TRUE, week_start = 1))  # Adjust 'week_start' if your week starts on a different day

# Analyze and plot tweet counts by day of the week
best_posting_days <- hslu_zhaw_high_engagement_tweets %>%
  group_by(day_of_week) %>%
  summarise(count = n(), .groups = 'drop') %>%
  arrange(desc(count))

# Plotting
ggplot(best_posting_days, aes(x = day_of_week, y = count)) +
  geom_bar(stat = "identity", fill = "coral") +
  labs(title = "Tweet Engagement by Day of the Week for HSLU and ZHAW",
       x = "Day of the Week",
       y = "Number of High Engagement Tweets") +
  theme_minimal()

BFH Frequency Analysis

In this section, we will analyze BFH (Bern University of Applied Sciences) to find out what tweets they usually use the most.

Text Preprocessing:

For this we must again take text prepossessing measures, as in the previous analyses.

#Filter Tweets for HSLU and ZHAW
bfh_tweets <- filtered_tweets %>%
  filter(university %in% "bfh")

# Rebuild the corpus with the sampled data
corpus4 <- Corpus(VectorSource(bfh_tweets$clean_text)) 

corpus4 <- tm_map(corpus4, content_transformer(tolower))  # Convert to lower case
corpus4 <- tm_map(corpus4, removePunctuation)             # Removing punctuation marks
corpus4 <- tm_map(corpus4, removeNumbers)                 # Removing numbers
corpus4 <- tm_map(corpus4, removeWords, stopwords("german"))  # Removing stop words
corpus4 <- tm_map(corpus4, removeWords, stopwords("french"))
corpus4 <- tm_map(corpus4, removeWords, stopwords("italian"))
corpus4 <- tm_map(corpus4, removeWords, stopwords("english"))
corpus4 <- tm_map(corpus4, stripWhitespace)               # Removal of additional spaces
corpus4 <- tm_map(corpus4, stemDocument) #remove suffixes, etc.; only root form of the word

# Further clean the text by removing specific web/text symbols and terms
corpus4 <- tm_map(corpus4, content_transformer(function(x) {
  x <- gsub("–", "", x)
  x <- gsub("…", "", x) 
  x <- gsub("Β«", "", x) 
  x <- gsub("Β»", "", x) 
  x <- gsub("\\b(rt|www|emojiemoji)\\b", "", x, ignore.case = TRUE)  # Remove 'rt', 'www', and 'emojiemoji'
  x <- gsub("amp", "", x, ignore.case = TRUE)  # Remove 'amp' from HTML encoded '&'
  x <- gsub("http[s]?://\\S+", "", x)  # Remove URLs
  return(x)
}))

# Create DTM and remove sparse terms
dtm3 <- DocumentTermMatrix(corpus4, control = list(removePunctuation = TRUE, stopwords = TRUE, wordLengths = c(1, Inf)))
dtm3 <- removeSparseTerms(dtm3, sparse = 0.99)  # Adjust sparsity threshold as needed


Text Analyse:

The word cloud illustrates which topics have the highest frequency.

  • Practical and Innovative Focus: Terms like β€œPraxis” (practice) and β€œInnov” (innovation) indicate a strong link between academic content and real-world applications, appealing particularly to an audience interested in actionable and cutting-edge information.
  • Community and Collaboration: Words such as β€œzusammen” (together) and β€œunsere” (our) reflect a community-focused approach, promoting collective efforts and teamwork within the university setting.
  • Local Identity and Quality: The mention of β€œSchweizer” (Swiss) suggests content with a national focus, likely resonating with local pride, while β€œQualitΓ€t” (quality) underscores the university’s commitment to high standards in education and research.
set.seed(123)
# Ensure word names are captured
word_freq3 <- sort(rowSums(as.matrix(dtm3)), decreasing = TRUE)
top_word_freq3 <- head(word_freq3, 80)
word_names3 <- colnames(dtm3)

# Generate word cloud using the correct word names
wordcloud(
  words = word_names3, 
  freq = top_word_freq3, 
  max.words = 80,
  scale = c(4, 0.5),       # Control for size of the most and least frequent words
  random.order = FALSE,    # Higher frequency words appear first
  rot.per = 0.25,          # Allows some rotation for fitting
  colors = brewer.pal(8, "Dark2")  # Enhances visual appeal
)


Most Frequent Emojis:

  • Technology and Innovation: Directional emojis like πŸ‘‰ and devices such as πŸ’» and innovations (πŸ”‹, πŸš€, πŸ€–) dominate, highlighting content on technological advancements and future trends.
  • Environmental Themes: Nature-related emojis (🌴, 🌲, 🌳, ♻️) emphasize environmental issues and sustainability efforts.
  • Community and Celebrations: Emojis like πŸŽ‰ and πŸ‘ are used for celebrations and achievements, fostering community spirit.
  • Health and Lifestyle: Emojis like πŸ₯₯, πŸ₯¦, and πŸ₯• suggest a focus on health and nutrition.
  • Global and Cultural Awareness: Symbols like 🌐 and 🌍, along with the Swiss flag πŸ‡¨πŸ‡­, point to global awareness and local identity.
# Analyze the frequency of different emojis
emoji_freq3 <- table(unlist(bfh_tweets$emojis))
sort(emoji_freq3, decreasing = TRUE)[1:30]
## 
##   πŸ‘‰   πŸ”‹   πŸ‘‡   🌴   🌲   πŸŽ‰   πŸ’‘   πŸ’»   πŸš€   πŸ€–   πŸ‡¨πŸ‡­   🌳   πŸ‘   πŸ“…   πŸ₯₯   🌱 
##   49   16   12   11   10   10   10   10   10   10    9    9    9    9    9    8 
##   πŸš—   πŸ₯‚   ✨   🌐    ♻️   πŸŽ„   🐝   πŸ₯¦    β˜€οΈ   🌍   🏑   🐴 πŸ‘¨β€πŸŽ“   πŸ₯• 
##    8    8    7    7    6    6    6    6    5    5    5    5    5    5

BFH Engagement - Hours

The highest volume of tweets is sent between 08:00 and 09:00, with a notable peak also around 06:00. There’s a gradual decline in tweet activity as the day progresses, especially after 17:00, indicating lower activity in the evening.

# Analyze and plot tweet counts by hour to find the best posting times
best_posting_hours2 <- bfh_tweets %>%
  group_by(timeofday_hour) %>%
  summarise(count = n(), .groups = 'drop') %>%
  arrange(desc(count))

# Plotting
ggplot(best_posting_hours2, aes(x = timeofday_hour, y = count)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Tweet Frequency by Hour for BFH",
       x = "Hour of the Day",
       y = "Number of Frequency Tweets") +
  theme_minimal()

BFH Engagement - Days

The data shows a consistently high level of tweeting activity from Monday through Friday, with the peak on Wednesday, followed by a sharp decline during the weekend.

# Extract the day of the week from 'tweet_date'
bfh_tweets <- bfh_tweets %>%
  mutate(day_of_week = wday(tweet_date, label = TRUE, week_start = 1))  # Adjust 'week_start' if your week starts on a different day

# Analyze and plot tweet counts by day of the week
best_posting_days2 <- bfh_tweets %>%
  group_by(day_of_week) %>%
  summarise(count = n(), .groups = 'drop') %>%
  arrange(desc(count))

# Plotting
ggplot(best_posting_days2, aes(x = day_of_week, y = count)) +
  geom_bar(stat = "identity", fill = "coral") +
  labs(title = "Tweet Frequency by Day of the Week for BFH",
       x = "Day of the Week",
       y = "Number of Frequency Tweets") +
  theme_minimal()

Recommendations

In this section, we will outline the most important recommendations for action for BFH from the previous analyses, so that they can use these measures to generate an increase in interactions on their Twitter account.

Optimal Terms

  • Focus on digital and innovative topics: Words like β€œdigital”, β€œdigit”, β€œdata”, and β€œopen” are topics around digitalization and open data initiatives are attracting a lot of interest.

  • Emphasize sustainability and climate targets: Terms like β€œclimate goal” and β€œsustain” show that discussions around sustainability and environmental responsibility resonate. This is partly implemented with sustainability.

  • Use interactive elements: The use of direct address such as β€œgesprΓ€ch” can help increase interactivity and community engagement.

  • Expand and include the target audience: Words like β€œchunivers” (CH universities) suggest that a broader discussion on topics that affect multiple universities will resonate.


These contributions on these topics should be taken up more by BFH in order to achieve more interactions on the account.

Optimal Emojis

Emojis such as ➑️ πŸ‡¨πŸ‡­ πŸ’› πŸ˜€ πŸ˜‰ should be used more, as these were the most commonly used in terms of high interaction in other universities. The emoji πŸ‘‰ is already widely used and should continue to be used.

Optimal Posting Times

Analysis shows that tweets generate the most engagement in the early mornings and late afternoons. It would be advisable to schedule important announcements and content during these times.

Optimal Posting Weekday

Engagement analysis by day of the week shows that from Monday to Thursday engagement is generally higher than on weekends. Tuesday, in particular, is a day that receives the most interactions on the accounts. BFH should consider focusing its main communication on these days and especially on Tuesday.

Conclusion

Our findings reveal that high engagement for tweets correlates strongly with content focused on digital transformation, sustainability, and institutional activities. Notably, tweets featuring themes of innovation, open data, and environmental initiatives received more engagement, reflecting a broader interest in these areas among the audience.
Furthermore, the timing of posts plays a crucial role in maximizing visibility and interaction. The data indicated specific hours and days where engagement peaked, suggesting optimal times for posting to ensure maximum reach.
By implementing the recommendations, BFH can increase engagement on their social media platforms and maybe also reinforce their position as a forward-thinking and impact educational institution in Switzerland.